[OpenVINO] Export DFlash for OpenVINO#1756
Open
ofirzaf wants to merge 12 commits into
Open
Conversation
- Introduced `--dflash-target-model` argument for exporting DFlash draft models. - Implemented `update_config_for_dflash` to handle DFlash-specific configurations. - Enhanced model conversion and metadata handling for DFlash models. - Added `DFlashDummyInputGenerator` for generating dummy inputs specific to DFlash. - Updated tests to include DFlash model loading and export functionality. This update enables the export and inference of models utilizing DFlash architecture, enhancing the OpenVINO integration.
- Removed the direct call to `_load_target_weights` in the constructor of `Qwen3DFlashForCausalLM`. - Added a class method `from_pretrained` to handle loading weights and configurations more effectively. - Updated weight handling to ensure compatibility with the target data type. - Modified the `extract_dflash_debug_bundle.py` script to use `dtype` instead of `torch_dtype` and added `attn_implementation` parameter for draft model loading. These changes improve the model's initialization process and enhance the flexibility of loading configurations.
…dels - Introduced functions to check and annotate hidden states in models during export. - Enhanced configuration to include hidden state outputs for models with multiple hidden layers. - Implemented a test suite to validate hidden state annotations in exported OpenVINO models. These changes improve the model export process by allowing the inclusion of hidden states, which is essential for certain text generation tasks.
- Implemented helper functions to find and add model outputs based on tensor names. - Added a new test case to validate that annotated hidden state outputs match those from PyTorch for the GPT-2 model. - Enhanced the export process to include hidden state outputs, ensuring compatibility with text generation tasks. These changes improve the testing framework for OpenVINO model exports, specifically focusing on hidden state annotations.
- Added support for overriding the DFlash block size via the environment variable `DFLASH_BLOCK_SIZE_OVERRIDE`. - Included error handling to ensure the block size is an integer greater than 1. - This enhancement allows for more flexible configuration of DFlash model exports, improving usability and performance. These changes contribute to the ongoing improvements in the OpenVINO export process for DFlash models.
- Added support for committed prefix cache policy in DFlash models by updating runtime information. - Modified `DFlashDummyInputGenerator` to use "hidden_states" instead of "target_hidden" for input names. - Updated Qwen3DFlash model to handle hidden states and past key values more effectively during inference. - Introduced a new script to compare DFlash cache semantics between original and patched models. - Enhanced tests to validate the integration of hidden states and ensure consistency in outputs. These changes improve the functionality and testing of DFlash models within the OpenVINO framework, ensuring better performance and reliability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
We implement the support to export DFlash draft models for speculative decoding with OpenVINO.
Also, we implement hidden_states annotations in exported OV models to better support operations that require hidden_states as outputs from OV models (like DFlash/Eagle3) methods, that will be applied automatically to all models exported for text generation the graph doesn't change as this is only annotations.
Commands to export DFlash model with this PR:
optimum-cli export openvino \ --model z-lab/Qwen3.6-Coder-35B-A3B-DFlash \ --task text-generation-with-past \ --trust-remote-code \ --dflash-target-model Qwen/Qwen3.6-35B-A3B \ --disable-convert-tokenizer \ qwen3.6-35b-a3b-dflash-int8-ovBefore submitting